Given the escalating effects of climate change, understanding the impact on infectious diseases is of critical importance. This study explores the relationship between climate change and influenza incidence by analyzing historical disease rates and meteorological data spanning from 1997 to 2024. The primary objectives were to identify correlations between climate factors and flu incidences, develop and validate predictive models, and assess their ability to forecast influenza trends beyond one year. Two phases of analysis, yearly and monthly, demonstrated different correlations for the percentage of the population testing positive for flu, as well as varying accuracy and generality of predictive models. Monthly data analysis revealed stronger correlations and the models outperformed the yearly models. Future projections suggest a decrease in influenza incidence with an increase in extreme weather patterns over the coming decades. Future research incorporating comprehensive meteorological data and accurate population forecasting is imperative.
Along with commonly known effects of climate change, such as melting polar ice caps, extreme temperatures, and rising ocean levels, there is another possibility that is less well-known (Flahault et al., 2016). The nexus between global climate change and the incidence of infectious diseases has emerged as a critical area of study, however, the relationship between climate change and respiratory infections such as influenza remains less understood. A thorough understanding of this relationship is crucial for enhancing the preparedness and resilience of medical and public health systems against annual and pandemic-level influenza threats (Lane et al., 2022).
Evidence has emerged that reveals that infectious disease incidence is influenced by seasonal drivers such temperature, humidity, and contact patterns (Axelsen et al., 2014). The complexity of these factors makes it difficult to conduct and understand long term predictions of infectious diseases to any degree of accuracy. Subsequently, there are varying conclusions on the effects of climate change on influenza incidence and prevalence. In a yet-to-be-peer reviewed study in 2021, they conclude that climate change “generally acts to reduce the intensity of influenza epidemics” due to the increase of specific humidity having a buffering effect on respiratory disease. However, the reduction in intensity of epidemics is offset by an increased persistence of seasonal epidemics; concluding that although the severity of incidence of influenza might decrease, these epidemics would be constant (Baker et al., 2021). These factors also very between states. Each state has a humidity threshold beyond which the influenza incidence increases significantly (Serman et al., 2022; Lee & Wang, 2022). There is a correlation with influenza incidences peaking in the colder winter months (Axelsen et al., 2014).
There is a very real and present concern that as temperatures become more extreme with climate change, that epidemics and pandemics will become more frequent (Bolles, 2024). Record breaking heat, flooding, droughts, and high rains all result from warming cycles that have been increasing since the Industrial Revolution (Bolles, 2024). Beyond the inherent rise in mortality with extreme temperature implications, infectious disease epidemics will change as well, causing significant effects in areas that historically do not see a lot of incidence or worsening the intensity of existing patterns (Ebi et al., 2021).
Current predictive models that look at climate and influenza have very little predictive power beyond one year (Axelsen et al., 2014). Many models that predict future influenza outbreaks focus on novel strains that may arise. These models are used in decisions about which flu vaccines to distribute to the public. If these models are incorrect, there are significant implications for mortality, as was seen with the H1N1 pandemic in 2009 (Ebi et al., 2021; Jilani et al., 2024). Previous models have been accurate in predicting influenza incidence in relatively stable climatic areas such as tropical climates and temperate climates (Morris et al., 2018). The broad application of predictive models is validated, however, given the various factors that are included in both meteorological data and incidence data, creating an accurate model that can predict years into the future is imperative.
The objective of this project is to analyze historical influenza data to identify seasonal patterns and critical factors that impact the severity and timing of influenza outbreaks. Additionally, predictive models to forecast future influenza trends and outbreaks will be developed and validated. This data will be analyzed for future use, beyond one year. Three hypotheses are being examined in this study:
There will be a correlation between meteorological data and flu incidence.
The testing and validation of multiple predictive models will validate their use in this domain.
These models will be able to predict future data beyond one year.
Two stages of testing occurred to identify what data is needed to accurately predict future epidemics and inform reporting standards of current flu incidences. The first stage included yearly data that was aggregated from each month. Average temperature for the year were evaluated against average total incidences in that year to determine future incidence rates for the years 1997 - 2024. The second stage evaluated monthly data. The total incidence, average temperature, and percent of the population positive for influenza were analyzed per month for the years 1997 - 2024.
First, data was cleaned, parsed, and prepared as outlined below. Then, trends of both datasets were examined using graphical visuals made with plotly in R. After individual analysis of both of the datasets, they were combined to make the dataframes that will be used in model training and evaluation. These relationships were also examined using plotly in R.
Four functions were used to calculate and evaluate the model metrics. Residual QQ plots, error distributions, and feature importance heatmaps were visualized for each model. Five different predictive models (linear regression, random forest, support vector machine, gradient boosting, and elastic net models) were developed to include meteorological factors and predict seasonal influenza activity. Coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), and the correlation between the prediction and the test target were used to evaluate the models’ performance. Additionally, learning curves of each model were used to determine optimal parameters. Percent positive was used as the variable to predict, as it takes into account the differences in population growth.
Before models’ creation began, the correlations of the variables in question were examined to ensure that the models accurately reflected and predicted the data. Both Pearson and Spearman correlations were utilized to look at both linear and rank-order relationships. The two correlations helped inform the partitions used to split the training and testing sets so that both sets had approximately the same initial correlation between the two variables.
After these models were evaluated for their prediction accuracy and were validated for their use in this domain, future data was synthetically generated using a linear regression model given the trends in meteorological data. This future data was then used to predict the incidence of influenza for the years 2025 - 2050. Future predictions on temperature and percent positive rates will be made using the top performers from the previous analysis. These will then be evaluated to determine if the predictive accuracy extends beyond one year.
Using these statistical techniques, we will be able to draw conclusions about the relationships between meteorological data and influenza incidence, validating the use of predictive modeling in this domain and informing future public health policy implementations.
There were 5 primary data sets used in this analysis across two domains of information, meteorological data and influenza incidence rates.
The meteorological data was collected from the National Center for Environmental Information National Oceanic Atmospheric Administration (National Centers for Environmental Information & National Oceanic and Atmospheric Administration, 2024). Three datasets were used here: climate codes, climate data maximum, and climate data minimum.
“Climate codes” describes the record format for the state and national files.
“Climate data maximum” gives the maximum temperature in Fahrenheit for each month of the year for years 1895 - 2024.
“Climate data minimum” gives the minimum temperature in Fahrenheit for each month of the year for years 1895 - 2024.
These dataframes were joined and cleaned to remove NAN values and were parsed together using the climate code data set to ensure that state names and codes were assigned correctly.
Influenza incidence data was collected from the World Health Organization and CDC National Respiratory and Enteric Virus Surveillance System (World Health Organization et al., 2024). Two main datasets from this repository were used: WHO_NREVSS_Clinical_Labs and WHO_NREVSS_Combined_prior_to_2015_16.
“WHO_NREVSS_Clinical_Labs” contained weekly data from the years 2015 - 2024 about the total incidence from influenza tests that were sent in for verification and the percent of the population that tested positive for influenza.
“WHO_NREVSS_Combined_prior_to_2015_16” contained weekly data from the years 1997 - 2014 about the total incidence from influenza tests that were sent in for verification and the percent of the population that tested positive for influenza.
It is important to note these datasets only reported the positive influenza tests that were reported, potentially excluding flu counts that were not reported. For this analysis, both datasets contained incidence rates of Influenza A and Influenza B, however, only the total specimen count and total percent positive were used to analyze the whole picture of influenza infection rates.
Both dataframes with the climate maximums and minimums were read in and cleaned of NAN values. Then, only national values were used and the averages for each month were found. This makes the climate data more comparable to the influenza data.
Both flu datasets were read in and cleaned. Weeks were parsed into months and assigned to align with the climate data. Both total incidence and percent positive were included in this dataset for analysis, but percent positive is the primary variable used.
Examining the variables from the onset can shed some light on the patterns of both influenza rates and seasonal temperatures. Topics of interest to explore include seasonal patterns related to both variables, overall trends in the past ~20 years, and visual depictions that can help us determine the changes that have occurred over time.
Figure 1: Climate and influenza data per year from 1895 - 2024. Due to data restrictions, influenza data is unavailable before 1997.
By examining the seasonal trends, we can see that temperature (unsurprisingly) increases during the summer months of June, July, and August and decreases during the winter months of January, February, November, and December. Total influenza incidence has significantly increased in recent years. Temperature trends also show increases in averages across all months.
Figure 2: Normalized climate and influenza trends from 1997 - present with trend lines denoted with dashed lines.
Both influenza and temperature trends from the past 20 years have increased.
Figure 3: Climate stripes from 1895 - 2024 with color representing the average temperature in that year. Red lines denote major climactic events that have happened: the 1930 Dust Bowl and the 1998 Super El Niño. This data visualizes that over the past ~100 years, average yearly temperatures have significantly increased.
Figure 4: Influenza stripes from 1997 - 2024 with color representing the average flu incidence in that year. Red lines denote major epidemics with H1N1 in 2009 and COVID-19 in 2020. Over the past ~100 years, average flu incidences have significantly increased.
## # A tibble: 6 × 4
## year_of_record yearly_avg yearly_total_incidence yearly_percent_positive
## <dbl> <dbl> <dbl> <dbl>
## 1 1997 52.2 33861 3.45
## 2 1998 54.2 91842 4.58
## 3 1999 53.9 117499 8.35
## 4 2000 53.2 95293 3.55
## 5 2001 53.7 98801 4.39
## 6 2002 53.2 121742 6.88
The first dataset that will be used is the combined dataset with the yearly averages.
## # A tibble: 6 × 5
## year_of_record month total_incidence average_temp percent_positive
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 1997 october 6025 53.9 0.560
## 2 1997 november 10982 40.9 0.747
## 3 1997 december 16854 34.0 8.48
## 4 1998 january 23721 35.0 23.5
## 5 1998 february 21034 38.7 21.4
## 6 1998 march 10109 41.3 6.70
The second dataset that will be used is the combined dataset with the monthly averages.
Figure 5: Yearly relationships between average temperature and average percent of the population positive for influenza. Color denotes the year of record.
Figure 6: Monthly relationships between average temperature and average percent of the population positive for influenza each year. Color denotes the month.
The monthly data shows us significantly more rich analysis. Looking at 2009 in both the yearly and monthly data, it is a significant outlier due to the H1N1 pandemic (a type of Influenza A). Pandemics like this can inform possible temporal relationships, showing that pandemics like this are rare, but influential.
Numerous functions were used to develop and analyze the models.
The “Metric Function” calculates the RMSE, MAE, R2, MSE, MAPE, and Correlation between the predictions and the test target (percent positive -> average temperature).
The “Residual QQ Plot Function” plots the residuals of the model
it is given.
The “Error Distribution Heatmap” function plots the error distribution heat map of the model it is given.
The “Feature Importance Heatmap” function plots a heatmap of the important features to the model’s performance.
These functions are used on every model to analyze their performance. Not all graphs are shown and included in the final analysis.
First, as mentioned in methods, the aggregated yearly data will be examined and models will be trained on the yearly data.
First, it is important to look at the correlations between the percent of the population that tests positive for influenza and the average yearly temperature.
# Total Incidence
correlation_pearson_ti <- cor(combined_data$yearly_total_incidence, combined_data$yearly_avg, use = "complete.obs", method = "pearson")
correlation_spearman_ti <- cor(combined_data$yearly_total_incidence, combined_data$yearly_avg, use = "pairwise.complete.obs", method = "spearman")
cat("Total Incidence Correlation:\nPearson correlation: ", correlation_pearson_ti, "\nSpearman correlation: ", correlation_spearman_ti, "\n")## Total Incidence Correlation:
## Pearson correlation: 0.4091401
## Spearman correlation: 0.3475643
# Percent Positive
correlation_pearson_pp <- cor(combined_data$yearly_percent_positive, combined_data$yearly_avg, use = "complete.obs", method = "pearson")
correlation_spearman_pp <- cor(combined_data$yearly_percent_positive, combined_data$yearly_avg, use = "pairwise.complete.obs", method = "spearman")
cat("\nPercent Positive Correlation:\nPearson correlation: ", correlation_pearson_pp, "\nSpearman correlation: ", correlation_spearman_pp)##
## Percent Positive Correlation:
## Pearson correlation: -0.2394929
## Spearman correlation: -0.2134647
The preliminary correlation analysis shows interesting data. The total incidence shows a moderate positive relationship with respect to both linear (pearson) and rank - order relationships (spearman). However, the percent positive show a weak negative relationship with respect to both linear and rank-order relationships. The Pearson correlation tends to be higher when the relationship is more linear, and the Spearman correlation tends to be higher when ranking order is considered, irrespective of the relationship. In the yearly data, this relationship is not particularly strong.
First, the training and testing data are evaluated against each other to see if their correlations between the two variables in question (yearly_percent_positive and yearly_avg) are relatively close. This tells us whether or not the testing data is similar to the training data, and gives us an idea if of the models’ performance.
correlation_result_yearly1 <- calculate_correlation(test_data, "yearly_avg", "yearly_percent_positive", "Test Data 1")## Dataset: Test Data 1
## Correlation between yearly_avg and yearly_percent_positive :
## Correlation: -0.7739
correlation_result_yearly2 <- calculate_correlation(train_data, "yearly_avg", "yearly_percent_positive", "Training Data 1")## Dataset: Training Data 1
## Correlation between yearly_avg and yearly_percent_positive :
## Correlation: -0.2019
Figure 7: Heatmap of the correlations in the dataframe ‘combined_data’. The correlation between the yearly average temperature and the average yearly percent of the population positive for influenza is -0.23 whereas the correlation between the yearly average temperature and the yearly total incidence of influenza is 0.40.
Using a partition of 0.8, meaning that 80% of the data is used for training and 20% is used for testing, the correlation for the testing data is approximately -0.78 while the correlation for the training data is -0.20.
Using a partition of 0.4, meaning that 40% of the data is used for training and 60% is used for testing, the correlation for the testing data is approximately -0.408 while the correlation for the training data is -0.15.
Comparing the correlation between percent_positive and yearly_avg (-0.23) with the training and testing splits, the best option is a partition of 0.8, in order to give the model the optimal chance to make accurate predictions.
## Model Correlation R2 RMSE MAE MSE MAPE
## 1 Linear Regression 0.7738915 0.59890811 3.629235 2.873273 13.17134 2.772481
## 2 Random Forest -0.7356542 0.54118707 4.379012 3.621421 19.17575 2.845330
## 3 Gradient Boosting -0.1111647 0.01235759 3.741804 2.892591 14.00110 2.523603
## 4 SVM -0.8925786 0.79669656 4.876608 4.003106 23.78131 3.247605
## 5 Elastic Net 0.7738915 0.59890811 3.907454 3.077875 15.26820 2.971392
Here, linear regression and elastic net showed the strongest positive correlations and similar performance in R2 values, RMSE, and MAE. However, SVM had the strongest negative correlation but the highest R squared value, suggesting that it was overfit.
Figure 8: The actual vs. predicted values from the models trained on yearly data. The ideal fit is denoted by a red dotted line, and the closer the values are to the line, the better the models’ performance.
Here, the Linear Regression, SVM, and Gradient Boosting models visually look good. However, further analysis is required to determine the best option.
Figure 9: The residuals from the models vs the fitted values. The more randomly scattered the data points are around the 0 line, the better the model performance.
Here there is a slight negative linear trend, possibly suggesting that the relationship between the variables yearly_avg_temp and yearly_percent_positive is non-linear.
Figure 10: A plot of the correlations between the actual and predicted values for each model (how well the model’s predictions matched the actual values). More positive values indicate a better model performance.
Linear Regression and Elastic net are equally as accurate in this analysis, with gradient boosting, random forest, and SVM performing poorly.
Figure 11: A visual of the most important factors contributing to the models performance to predict the average percent positive per year depending on the average yearly temperature.
The most important years were 2004, 2009, 2012, 2014 and 2023. These signify the years that have the highest percentage of people testing positive for the flu. Intuitively, 2009 is the highest temporal predictor. However, the yearly average temperature did not contribute significantly to the model’s performance.
In evaluating the performance of different predictive models, it is important to see how they learn and perform with different partitions of training sets.
Figure 12: The learning curves for the 5 different models on the aggregated yearly data across different partitions of training data.
There are numerous insights to draw from inspection of the learning curves. The training error typically decreases as the training set increases due to more information provided to the model, and more from the model to learn from. Validation error may decrease with more training data as the model can be more accurately generalized. Validation error plateaus when there is an indication of overfitting (the model complexity is too high compared to the amount of data). Underfitting is signified when training error is low but validation error is high. Ideally, the training and validation errors are low and converge as the training set size increases, signifying that the model is neither overfitting or underfitting the data.
The yearly learning curves reinforce that linear regression is the best fit for the data. The data converges at a relatively low error. Elastic Net also shows similar results, aligning with the previous conclusions that these were the two best models for yearly data. Interestingly, all models perform relatively strongly at a partition of 0.4.
Second, as mentioned in methods, the aggregated monthly data will be examined and models will be trained on the monthly data to examine if there are differences in accuracy with more data.
First, it is important to look at the correlations between the percent of the population that tests positive for influenza and the average monthly temperature.
set.seed(1998)
combined_data2$total_incidence <- as.numeric(combined_data2$total_incidence)
combined_data2$average_temp <- as.numeric(combined_data2$average_temp)
combined_data2$percent_positive <- as.numeric(combined_data2$percent_positive)
# Total Incidence
correlation_pearson_ti2 <- cor(combined_data2$total_incidence, combined_data2$average_temp, use = "complete.obs", method = "pearson")
correlation_spearman_ti2 <- cor(combined_data2$total_incidence, combined_data2$average_temp, use = "complete.obs", method = "spearman")
cat("Total Incidence Correlation:\nPearson correlation: ", correlation_pearson_ti2, "\nSpearman correlation: ", correlation_spearman_ti2, "\n")## Total Incidence Correlation:
## Pearson correlation: -0.235857
## Spearman correlation: -0.3803802
# Percent Positive
correlation_pearson_pp2 <- cor(combined_data2$percent_positive, combined_data2$average_temp, use = "complete.obs", method = "pearson")
correlation_spearman_pp2 <- cor(combined_data2$percent_positive, combined_data2$average_temp, use = "complete.obs", method = "spearman")
cat("\nPercent Positive Correlation:\nPearson correlation: ", correlation_pearson_pp2, "\nSpearman correlation: ", correlation_spearman_pp2)##
## Percent Positive Correlation:
## Pearson correlation: -0.5626628
## Spearman correlation: -0.6271178
If we analyze total incidence, both the Pearson and Spearman correlations are negative, which indicates an inverse relationship with average temperature (as temperature goes down, incidence goes up). The relationship is weakly negative linearly (Pearson) but moderately negative when comparing rank order (Spearman).
Examining percent_positive, both the Pearson and Spearman correlations are strongly negative both linearly and when considering rank order. This indicates that there is a stronger inverse relationship between percent_positive and average temperature than with total incidence. We expect to see this with monthly data.
The training and testing data are evaluated against each other to see if their correlations between the two variables in question (percent_positive and average_temp) are relatively close. This tells us whether or not the testing data is similar to the training data, and gives us an idea if of the models’ performance.
correlation_result_monthly1 <- calculate_correlation(test_data2, "average_temp", "percent_positive", "Test Data 2")## Dataset: Test Data 2
## Correlation between average_temp and percent_positive :
## Correlation: -0.5353
correlation_result_monthly2 <- calculate_correlation(train_data2, "average_temp", "percent_positive", "Train Data 2")## Dataset: Train Data 2
## Correlation between average_temp and percent_positive :
## Correlation: -0.5732
Figure 13: Heatmap of the correlations in the dataframe ‘combined_data2’. The correlation between the monthly average temperature and the monthly percent of the population positive for influenza is -0.54 whereas the correlation between the yearly average temperature and the yearly total incidence of influenza is -0.25.
The correlations of the testing dataset here (-0.53) and the training dataset (-0.57) match closely with each other when using a partition of 0.7, meaning that both training and testing sets are comparable, and with the actual correlation of -0.57. This makes sense as there is more data to factor in with monthly aggregates as compared to the yearly data. Although the typical partition to use when training is 80%, due to the high comparability between the training and testing correlations, the following models will be trained on the 70/30 partition.
## Model Correlation R2 RMSE MAE MSE
## 1 Linear Regression 0.7291544 0.5316661 6.466042 4.750917 41.80970
## 2 Random Forest 0.8309889 0.6905425 4.818461 3.036457 23.21757
## 3 Gradient Boosting 0.6056588 0.3668226 6.889429 4.924894 47.46423
## 4 SVM 0.7470646 0.5581056 5.766884 4.033250 33.25695
## 5 Elastic Net 0.7274193 0.5291388 6.310291 4.587377 39.81977
The random forest model stands out as the best performing model across all metrics. It has the highest correlation and R2, and the lowest error values. The gradient boosting model did not perform well with the monthly data. SVM, linear regression, and elastic net are all close to top performers, with SVM having slightly better performance across all metrics behind random forest.
Figure 14: The actual vs. predicted values from the models trained on monthly data. The ideal fit is denoted by a red dotted line, and the closer the values are to the line, the better the models’ performance.
All models perform relatively well here. Based on visual inspection, linear regression and elastic net are better than the others. Gradient Boosting does not do very well predicting above values of 16.
Figure 15: The residuals from the models vs the fitted values. The more randomly scattered the data points are around the 0 line, the better the model performance.
Here there is a negative linear trend, suggesting that the relationship between the variables average_temp and percent_positive is non-linear.
Figure 16: A plot of the correlations between the actual and predicted values for each model. More positive values indicate a better model performance.
All models performed well, however, Random Forest performed slightly better among all options.
Figure 17: A visual of the important factors that contributed to the model being able to predict the percent_positive.
The most important features were average_temp, the years 2009, 2012, and 2021, and the months February, January, and October.
As with the yearly models, identifying how the models learn over different partitioned training sets is important to understand their behavior and inform future recommendations for partitions.
Figure 18: The learning curves for the 5 different models on the monthly data across different partitions of training data. The lower the error, the better the model performance. These learning curves validate the previous results, showing Random Forest having the best performance.
The best performing models here are the random forest and elastic net models, mirroring previous results from error distribution, correlation tests, and metric analysis. The monthly models had higher errors than the yearly models, but overall had higher performance and prediction accuracy than the yearly models.
Future predictions were calculated by forecasting the average temperature from the existing trend line using a linear regression model. Then, these temperatures were used in 4 models (linear regression, SVM, XGBoost, and random forest) to predict the percent positive rate for the years 2025 - 2020.
Figure 19: This figure shows the predicted values of average yearly percent positive and average yearly temperature from the years 2025 - 2050.
Unsurprisingly, the random forest and SVM models both had sub-par performance, with incidence rates unchanging with the yearly and temperature differences. As Linear regression showed the highest performance in the initial training, we can infer that this is the most accurate model when using yearly data to predict future flu epidemics. The linear regression model shows the average yearly percentage of the population that tests positive for influenza decreasing from 6.59% in 2025 to 5.29% in 2050. However, linear regression may not capture all of the important factors that influence this prediction.
Given the possible limitations with using only aggregated yearly data to predict future epidemics, it is important to look at how the models perform when monthly data is used.
Figure 20: This figure shows the predicted values of average monthly percent positive and average monthly temperature for the years 2025 - 2050 using a linear regression model . Color signifies the month.
This model performs relatively well, with temperature trends following the expectation of warmer months increasing and colder months decreasing; reinforcing that climate change causes more extreme weather. Additionally, the percent positive rates of influenza decrease with time, aligning with previous literature on the topic. This model does very well with continuing the prediction beyond 1 year, a limitation of previous research on this topic.
As random forest performed well when validating, a random forest model will also be evaluated here to examine its performance on future predictions.
Figure 21: This figure shows the predicted values of average monthly percent positive and average monthly temperature for the years 2025 - 2050 using a random forest model. Color signifies the month.
This model performs well up until 2028, however, after 2028 the values do not change. This signifies good short-term predictive performance, but sub-par long-term predictions. This finding aligns with previous research about the limitation of the use of similar predicitive models beyond one year.
The relationship between climate change and influenza incidence is a pressing yet understudied area that has the potential for significant public health implications. This study explored the relationship by leveraging historical influenza and meteorological datasets to develop predictive models with the objective to identify potential correlations between climate factors and flu incidence, testing and validating various predictive models, and evaluating the model’s ability to forecast influenza trends beyond one year.
Two main phases of the project, yearly and monthly, yielded different, yet important results.
During the yearly data analysis, both the Pearson and Spearman correlations for the total annual incidence of influenza were positive but moderate, indicating that as the average temperature increases, total annual influenza incidences also increase. However, the correlations for percentage of the population positive was generally weaker and negative, suggesting a more complex and perhaps non-linear relationship between temperature and influenza incidence.
While training predictive models on yearly data, linear regression and elastic net models exhibited the strongest performance, with relatively high R2 values and moderate MAEs. The support vector machine showed the highest R2 value, but also indicated signs of overfitting from its higher RMSE and lower correlation overall.
The learning curves showed that all the models improved as the training set size increased, although linear regression and elastic net models converged more quickly and displayed lower errors overall, indicating that these models were likely more effective in capturing the yearly trends in the data within the range used.
When examining monthly aggregates, the correlations between average temperature and total incidence, as well as percentage positive, were generally strong and negative. This signifies a more pronounced and consistent inverse relationship at the monthly level compared to yearly aggregates.
All models showed improved performance when trained on monthly data. The random forest model, in particular, outperformed others in terms of RMSE, MAE, and R2, indicating its robustness in capturing the nuances of monthly influenza patterns. This suggests that models benefit from the finer granularity of monthly data, which likely captures seasonal variations better than yearly data.
The learning curves for monthly data models generally corroborated the finding that model performance improves with larger, more detailed datasets. All models showed a decline in error as the training set size increased, with random forest and support vector machine showing particularly strong performance.
Two future predictions were used on the monthly models to evaluate their usefulness. First, the linear regression model accurately produced different numbers from 2025 - 2050, following expected trends. Temperatures of the warmer months increased, and temperatures for colder months decreased, following the prediction that climate change causes more extreme temperature fluctuations (Ebi et al., 2021). The total percent of the population positive for influenza decreased slightly, also following the predictions of previous research (Baker et al., 2021). Overall, the linear model performed well in predicting future temperatures given current trends and future flu incidence rates.
Given the superior performance during validation, the random forest model was also employed to predict future influenza trends. Although the random forest predictions initially seem to be more accurate than the linear model, it has significantly less accuracy beyond 3 years into the future. The rates for 2028 - 2050 remained constant, both for temperature and flu rates. However, the rates for 2025 - 2028 fluctuate as observed in the original data.
The differences in these model predictions and accuracy can be plausibly attributed to three different factors: (1) limited data scope, (2) non-linear relationships, or (3) computational inefficiencies. First, there was a limited scope of available data. Meteorological data did not include specific humidity, elevation patterns, or other potentially important weather patterns. Second, the data could be non-linearly related which would limit the capability of the models to predict future values. Finally, computational inefficiencies related to the training and validation of the models or computational power available contribute to the models’ limitations.
The findings of this study revealed several critical insights and underscored the intricate relationship between meteorological factors and influenza incidence in the United States.
The observed correlations between meteorological variables and influenza incidences were significant. Yearly and monthly data analysis indicated that influenza rates have an inverse correlation with temperature, with monthly data showing a stronger relationship than yearly data. Specifically, colder temperatures corresponded to higher flu incidences. Seasonal patters became evident when analyzing monthly data, aligning with existing literature that suggests increased influenza transmission during colder months.
Multiple predictive models were tested using both yearly and monthly data. Linear regression and elastic net models showed strong performance with yearly data. Random forest and support vector machines demonstrated superior performance with monthly data. These modes can be used to capture finer seasonal variations, given that there is sufficient monthly data to do so. These models were validated using metrics such as RMSE, MAE, and R2 values, underscoring the robustness of the predictive modeling power, particularly in regards to monthly data.
Using linear regression and random forest models, future influenza incidences from 2025 - 2050 were predicted with varying results. The linear regression model performed consistently well, capturing gradual changes in influenza incidences and temperatures. The random forest model provided reliable short-term predictions but struggled with long-term forecasting. This result mirrors existing challenges in extending predictive accuracy beyond one year. However, it was able to reliably predict up to 3 years in the future. Overall, linear models were more reliable for long-term forecasts, while random forest models yielded better short-term predictions.
Future reporting standards for climate data and influenza data should focus on monthly, if not weekly, data. This study highlights the crucial intersection between climate change and public health, underscoring the importance of continued research and development of robust predictive models to mitigate future influenza outbreaks in the context of a rapidly changing climate.
There are numerous factors that contributed to the limited scope, generality, and applicability of these predictive models: (1) inherent data limitations, (2) correlation strength, and (3) population trend assumptions.
The influenza data spanned from 1997 - 2024 whereas the climate data spanned back to 1895. This potentially influenced the long-term trend analysis and contributed to the limited accuracy of long-term predictions. Weekly aggregation was limited to monthly data, which involved assumptions about the distribution of flu cases within months. As 53 weeks were given due to leap years, months were separated with the following week indexes:
january = 1 - 4
february = 5 - 8
march = 9 - 12
april = 13 - 16
may = 17 - 21
june = 22 - 25
july = 26 - 29
august = 30 - 33
september = 34 - 38
october = 39 - 43
november = 44 - 48
december = 49 - 53
While significant, the correlations between meteorological variables and influenza incidences were not robust enough to explain all of the variation. Possible non-linear relationships may exist that were not captured by the models.
The analysis assumed consistent population trends, somewhat mitigated by the use of percent positive instead of total incidence to analyze influenza data. Future studies should integrate population dynamics into the predictive models for more accurate forecasts.
Future research should take into account the limitations of this study in order to more accurately and comprehensively forecast future influenza epidemics. A wider range of meteorological factors should be incorporated, including but not limited to humidity, elevation, predictable climate patterns like La Niña and El Niño, and data about natural disasters. Exploring non-linear models and machine learning techniques might better capture the intricate relationships between climate and influenza variables. Accurate population projections should be included to adjust for demographic changes and migration patterns. Additionally, leveraging advanced computational power will enable more complex model training and validation, which could improve the long-term predictive capabilities. The addition of these factors will provide a more holistic and comprehensive understanding of future influenza dynamics.
I would like to extend my acknowledgements to Professor Ivo Dinov for his support and instruction throughout the Fall 2024 semester. His impact on my education cannot be understated. I would also like to give thanks to the staff at the Center for Healthcare Engineering & Patient Safety for providing the use of a loaner computer when my device broke, allowing me to complete this project on-time. Finally, recognition is due to the various researchers and scientists whose foundational work this study builds upon.
Axelsen, J. B., Yaari, R., Grenfell, B. T., & Stone, L. (2014). Multiannual forecasting of seasonal influenza dynamics reveals climatic and evolutionary drivers. Proceedings of the National Academy of Sciences, 111(26), 9538–9542. https://doi.org/10.1073/pnas.1321656111
Baker, R. E., Yang, Q., Worby, C. J., Yang, W., Saad-Roy, C. M., Viboud, C., Shaman, J., Metcalf, C. J. E., Vecchi, G., & Grenfell, B. T. (2021). Implications of climatic and demographic change for seasonal influenza dynamics and evolution (p. 2021.02.11.21251601). medRxiv. https://doi.org/10.1101/2021.02.11.21251601
Bolles, D. (2024, October 10). Extreme Weather and Cimate Change. NASA Science. https://science.nasa.gov/climate-change/extreme-weather/
Ebi, K. L., Vanos, J., Baldwin, J. W., Bell, J. E., Hondula, D. M., Errett, N. A., Hayes, K., Reid, C. E., Saha, S., Spector, J., & Berry, P. (2021). Extreme Weather and Climate Change: Population Health and Health System Implications. Annual Review of Public Health, 42, 293–315. https://doi.org/10.1146/annurev-publhealth-012420-105026
Extreme Weather and Climate Change. (n.d.). Center for Climate and Energy Solutions. Retrieved December 10, 2024, from https://www.c2es.org/content/extreme-weather-and-climate-change/
Flahault, A., de Castaneda, R. R., & Bolon, I. (2016). Climate change and infectious diseases. Public Health Reviews, 37, 21. https://doi.org/10.1186/s40985-016-0035-2
Jilani, T. N., Jamil, R. T., Nguyen, A. D., & Siddiqui, A. H. (2024). H1N1 Influenza. In StatPearls. StatPearls Publishing. http://www.ncbi.nlm.nih.gov/books/NBK513241/
Lane, M. A., Walawender, M., Carter, J., Brownsword, E. A., Landay, T., Gillespie, T. R., Fairley, J. K., Philipsborn, R., & Kraft, C. S. (2022). Climate change and influenza: A scoping review. The Journal of Climate Change and Health, 5, 100084. https://doi.org/10.1016/j.joclim.2021.100084
Lee, J. J., & Wang, A. (2022, March 4). NASA Finds Each State Has Its Own Climatic Threshold for Flu Outbreaks. Climate Change: Vital Signs of the Planet. https://climate.nasa.gov/news/3150/nasa-finds-each-state-has-its-own-climatic-threshold-for-flu-outbreaks
Morris, D. H., Gostic, K. M., Pompei, S., Bedford, T., Łuksza, M., Neher, R. A., Grenfell, B. T., Lässig, M., & McCauley, J. W. (2018). Predictive modeling of influenza shows the promise of applied evolutionary biology. Trends in Microbiology, 26(2), 102–118. https://doi.org/10.1016/j.tim.2017.09.004
National Centers for Environmental Information, & National Oceanic and Atmospheric Administration. (n.d.-a). /Pub/data/normals/1981-2010/climate_codes [Txt]. NCEI NOAA Data Repository. Retrieved November 20, 2024, from https://www.ncei.noaa.gov/pub/data/normals/1981-2010/
National Centers for Environmental Information, & National Oceanic and Atmospheric Administration. (n.d.-b). /Pub/data/normals/1981-2010/climate_data_maximum [Txt]. NCEI NOAA Data Repository. Retrieved November 20, 2024, from https://www.ncei.noaa.gov/pub/data/normals/1981-2010/
National Centers for Environmental Information, & National Oceanic and Atmospheric Administration. (n.d.-c). /Pub/data/normals/1981-2010/climate_data_minimum [Txt]. NCEI NOAA Data Repository. Retrieved November 20, 2024, from https://www.ncei.noaa.gov/pub/data/normals/1981-2010/
National Centers for Environmental Information (NCEI). (n.d.). Retrieved December 10, 2024, from https://www.ncei.noaa.gov/
Serman, E., Thrastarson, H. Th., Franklin, M., & Teixeira, J. (2022). Spatial Variation in Humidity and the Onset of Seasonal Influenza Across the Contiguous United States. GeoHealth, 6(2), e2021GH000469. https://doi.org/10.1029/2021GH000469
World Health Organization, National Respiratory and Enteric Virus Surveillance System, IliNet, & Centers for Disease Control and Prevention. (2024). National, Regional, and State Level Outpatient Illness and Viral Surveillance (Version 2024) [Dataset; Csv]. FluView. https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html